{
  "cells": [
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "zbmL45h7fWRT"
      },
      "source": [
        "# Create AI Agents that work with your PDFs using Chunkr & Mistral AI\n",
        "\n",
        "You can also check this cookbook in colab [here](https://colab.research.google.com/drive/1vkSfFUl-5oVDinKt8P0GChnaSOis57ij?usp=sharing)\n",
        "\n",
        "In this blog, we’ll introduce Chunkr, a cutting-edge document processing API designed for seamless and scalable data extraction and preparation, ideal for Retrieval-Augmented Generation (RAG) workflows and large language models (LLMs). Chunkr has been integrated with CAMEL. We’ll explore its three core capabilities—Segment, OCR, and Structure—each optimized to enhance document understanding and make data integration effortless. Finally, we’ll wrap up with a conclusion and a call to action.\n",
        "\n",
        "Key tools utilized in this notebook include:\n",
        "\n",
        "*   **CAMEL**: A powerful multi-agent framework that enables Retrieval-Augmented Generation and multi-agent role-playing scenarios, allowing for sophisticated AI-driven tasks.\n",
        "*   **Chunkr**: A powerful document processing API built for efficient and scalable data extraction and preparation, perfect for Retrieval-Augmented Generation (RAG) workflows and large language models (LLMs).\n",
        "*   **Mistral AI**: A series of high-performance LLMs.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Table of Content:\n",
        "\n",
        "1.  🧑🏻‍💻 Introduction\n",
        "\n",
        "2.  ⚡️ Step-by-step Guide of Digesting PDFs with Chunkr\n",
        "\n",
        "3.  💫 Quick Demo with CAMEL Agent\n",
        "\n",
        "4.  🧑🏻‍💻 Conclusion\n",
        "\n",
        "\n",
        "To run this, press \"*Runtime*\" and press \"*Run all*\" on a **free** Tesla T4 Google Colab instance!\n",
        "<div class=\"align-center\">\n",
        "  <a href=\"https://www.camel-ai.org/\"><img src=\"https://i.postimg.cc/KzQ5rfBC/button.png\"width=\"150\"></a>\n",
        "  <a href=\"https://discord.camel-ai.org\"><img src=\"https://i.postimg.cc/L4wPdG9N/join-2.png\"  width=\"150\"></a></a>\n",
        "  \n",
        "⭐ <i>Star us on [*Github*](https://github.com/camel-ai/camel), join our [*Discord*](https://discord.camel-ai.org) or follow our [*X*](https://x.com/camelaiorg)\n",
        "</div>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "wbUMBBIB3LM3"
      },
      "source": [
        "![chunkrv2.png]()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Bl90urdzuQLu"
      },
      "source": [
        "## **Introduction**\n",
        "\n",
        "Chunkr is a versatile API designed to revolutionize how documents are processed and made ready for advanced AI applications like RAG and LLMs. From extracting text to structuring complex layouts, Chunkr simplifies the workflow of transforming raw documents into actionable data.\n",
        "\n",
        "#### **Key Features of Chunkr:**\n",
        "\n",
        "1.   Document Segmentation:\n",
        "\n",
        "  *   Breaks down documents into coherent chunks using transformer-based models.\n",
        "\n",
        "  *   Provides a logical flow of content, maintaining the context needed for efficient data analysis.\n",
        "\n",
        "2.  Advanced OCR (Optical Character Recognition) Capabilities:\n",
        "\n",
        "  *   Extracts text and bounding boxes from images or scanned PDFs using high-precision OCR.\n",
        "\n",
        "  *   Makes content searchable, analyzable, and ready for integration into AI models.\n",
        "3. Semantic Layout Analysis:\n",
        "\n",
        "  *   Detects and tags content elements like headers, paragraphs, tables, and figures.\n",
        "\n",
        "  *   Converts document layouts into structured outputs like HTML and Markdown.\n",
        "\n",
        "\n",
        "#### **Why Use Chunkr?**\n",
        "\n",
        "*   Optimized for AI: Simplifies preparing data for LLMs and other AI models.\n",
        "\n",
        "*   Multi-Format Compatibility: Processes PDFs, DOCX, PPTX, XLSX, and more.\n",
        "\n",
        "*  Scalable Deployment: Use locally for small projects or deploy at scale with Kubernetes. Also, it is open-source!\n",
        "\n",
        "\n",
        "In this blog, we will focus on the capability of digesting PDF file with Chunkr.\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 📦 Installation"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7p-JjpyNVcCT"
      },
      "source": [
        "First, install the CAMEL package with all its dependencies."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": true,
        "id": "I2X5A0LBc92C"
      },
      "outputs": [],
      "source": [
        " pip install \"camel-ai[all]==0.2.11\""
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WCfZ6vA0iFQv"
      },
      "source": [
        "# ⚡️ Step-by-step Guide of Digesting PDFs with Chunkr\n",
        "\n",
        "\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Step 1: Set up your [chunkr API key](https://docs.chunkr.ai/quickstart).\n",
        "\n",
        "If you don't have a chunkr API key, you can obtain one by following these steps:\n",
        "\n",
        "1. Create an account:\n",
        "\n",
        "Go to [chunkr.ai ](https://www.chunkr.ai/)and sign up for an account.\n",
        "\n",
        "2. Get your API key:\n",
        "\n",
        "Once logged in, navigate to the API section of your account dashboard to find your API key. A new API key will be generated. Copy this key and store it securely."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "cuZwSWBYyCT8",
        "outputId": "684ab262-75aa-41af-a460-0d8e67316022"
      },
      "outputs": [],
      "source": [
        "import os\n",
        "from getpass import getpass\n",
        "# Prompt for the Chunkr API key securely\n",
        "\n",
        "chunkr_api_key = getpass('Enter your API key: ')\n",
        "os.environ[\"CHUNKR_API_KEY\"] = chunkr_api_key"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Alternatively, if running on Colab, you could save your API keys and tokens as **Colab Secrets**, and use them across notebooks.\n",
        "\n",
        "To do so, **comment out** the above **manual** API key prompt code block(s), and **uncomment** the following codeblock.\n",
        "\n",
        "⚠️ Don't forget granting access to the API key you would be using to the current notebook."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# import os\n",
        "# from google.colab import userdata\n",
        "\n",
        "# os.environ[\"CHUNKR_API_KEY\"] = userdata.get(\"CHUNKR_API_KEY\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "LUpFBEB9vrIz"
      },
      "source": [
        "Step 2: Let's load the example PDF file from https://arxiv.org/pdf/2303.17760.pdf. This will be our local example data.\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {
        "id": "dUVE7z9hwEV7"
      },
      "outputs": [],
      "source": [
        "import os\n",
        "import requests\n",
        "\n",
        "os.makedirs('local_data', exist_ok=True)\n",
        "\n",
        "url = \"https://arxiv.org/pdf/2303.17760.pdf\"\n",
        "response = requests.get(url)\n",
        "with open('local_data/camel_paper.pdf', 'wb') as file:\n",
        "     file.write(response.content)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "fwlB09epyeWK"
      },
      "source": [
        "Step 3: Submit one task."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "id": "SgIsQpOX82Bo",
        "outputId": "fe7849a7-4980-421a-9f84-af4c36a37ff4"
      },
      "outputs": [],
      "source": [
        "# Importing the ChunkrReader class from the camel.loaders module\n",
        "# This class handles document processing using Chunkr's capabilities\n",
        "from camel.loaders import ChunkrReader\n",
        "import nest_asyncio\n",
        "nest_asyncio.apply() \n",
        "\n",
        "# Initializing an instance of ChunkrReader\n",
        "# This object will be used to submit tasks and manage document processing\n",
        "chunkr_reader = ChunkrReader()\n",
        "\n",
        "# Submitting a document processing task\n",
        "# Replace \"local_data/example.pdf\" with the path to your target document\n",
        "await chunkr_reader.submit_task(file_path=\"local_data/camel_paper.pdf\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ecWmxX4ZyoS8"
      },
      "source": [
        "Step 4: Input the task id above and then we can obtain the task output.\n",
        "\n",
        "The output of Chunkr is structured text and metadata from documents, including:\n",
        "\n",
        "1.   **Formatted Content**: Text in structured formats like JSON, HTML, or Markdown.\n",
        "\n",
        "2.   **Semantic Tags**: Identifies headers, paragraphs, tables, and other elements.\n",
        "\n",
        "3.   **Bounding Box Data**: Spatial positions of text (x, y coordinates) for OCR-processed documents.\n",
        "\n",
        "4.   **Metadata**: Information like page numbers, file type, and document properties."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "rTKuFUXuyoS8",
        "outputId": "6082c541-f197-4c53-f8ef-a0c8a2f1d332"
      },
      "outputs": [],
      "source": [
        "# Retrieving the output of a previously submitted task\n",
        "chunkr_output = await chunkr_reader.get_task_output(task_id=\"902e686a-d6f5-413d-8a8d-241a3f43d35b\")\n",
        "print(chunkr_output)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "65Z2nDTfUM0P"
      },
      "source": [
        "## 💫 Quick Demo with CAMEL Agent\n",
        "\n",
        "Here we choose Mistral model for our demo. If you'd like to explore different models or tools to suit your needs, feel free to visit the [CAMEL documentation page](https://docs.camel-ai.org/), where you'll find guides and tutorials.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "AtaiHz437A2_"
      },
      "source": [
        "If you don't have a Mistral API key, you can obtain one by following these steps:\n",
        "\n",
        "1. Visit the Mistral Console (https://console.mistral.ai/)\n",
        "\n",
        "2. In the left panel, click on API keys under API section\n",
        "\n",
        "3. Choose your plan\n",
        "\n",
        "For more details, you can also check the Mistral documentation: https://docs.mistral.ai/getting-started/quickstart/"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "W8SxUAI87AOs",
        "outputId": "8b144860-8a67-4268-932b-8eccd39b74ac"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Enter your API key··········\n"
          ]
        }
      ],
      "source": [
        "import os\n",
        "from getpass import getpass\n",
        "\n",
        "mistral_api_key = getpass('Enter your API key')\n",
        "os.environ[\"MISTRAL_API_KEY\"] = mistral_api_key"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# import os\n",
        "# from google.colab import userdata\n",
        "\n",
        "# os.environ[\"MISTRAL_API_KEY\"] = userdata.get(\"MISTRAL_API_KEY\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "1mKu0Dgj5TV9"
      },
      "outputs": [],
      "source": [
        "from camel.configs import MistralConfig\n",
        "from camel.models import ModelFactory\n",
        "from camel.types import ModelPlatformType, ModelType\n",
        "\n",
        "mistral_model = ModelFactory.create(\n",
        "    model_platform=ModelPlatformType.MISTRAL,\n",
        "    model_type=ModelType.MISTRAL_LARGE,\n",
        "    model_config_dict=MistralConfig(temperature=0.0).as_dict(),\n",
        ")\n",
        "\n",
        "# Use Mistral model\n",
        "model = mistral_model"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "cPOelPF2UM0P",
        "outputId": "e2a5dd9f-86de-49e5-99d4-a654f7726d44"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Based on the provided content, here's a brief conclusion:\n",
            "\n",
            "The document introduces CAMEL, which stands for \"Communicative Agents for ‘Mind’ Exploration of Large Language Model Society.\" It seems to be a title of a paper or an initiative focused on the exploration of large language models, likely in the context of artificial intelligence and machine learning. The document also provides a URL, \"https://www.camel-ai.org,\" which likely leads to more information about the CAMEL initiative. Additionally, the name \"Guohao Li\" is mentioned, which could be the author or a person associated with this initiative.\n"
          ]
        }
      ],
      "source": [
        "from camel.agents import ChatAgent\n",
        "\n",
        "# Initialize a ChatAgent\n",
        "agent = ChatAgent(\n",
        "    system_message=\"You're a helpful assistant\",  # Define the agent's role or purpose\n",
        "    message_window_size=10,  # [Optional] Specifies the chat memory length\n",
        "    model=model\n",
        ")\n",
        "\n",
        "# Use the ChatAgent to generate a response based on the chunkr output\n",
        "response = agent.step(f\"based on {chunkr_output[:4000]}, give me a conclusion of the content\")\n",
        "\n",
        "# Print the content of the first message in the response, which contains the assistant's answer\n",
        "print(response.msgs[0].content)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "p3VfU-hdX4Ti"
      },
      "source": [
        "**For advanced usage of RAG capabilities with large files, please refer to our [RAG cookbook](https://docs.camel-ai.org/cookbooks/agents_with_rag.html).**"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "sR4BU5Z_oanP"
      },
      "source": [
        "## 🌟 Highlights\n",
        "\n",
        "In conclusion, integrating Chunkr within CAMEL-AI revolutionizes the process of document data extraction and preparation, enhancing your capabilities for AI-driven applications. With Chunkr’s robust features like Segment, OCR, and Structure, you can seamlessly process complex documents into structured, machine-readable formats optimized for LLMs, directly feeding into CAMEL-AI’s multi-agent workflows. This integration not only simplifies data preparation but also empowers intelligent and accurate analytics. With these tools at your disposal, you’re equipped to transform raw document data into actionable insights, unlocking new possibilities in automation and AI-powered decision-making.\n",
        "\n",
        "\n",
        "Key tools utilized in this notebook include:\n",
        "\n",
        "*   **CAMEL**: A powerful multi-agent framework that enables Retrieval-Augmented Generation and multi-agent role-playing scenarios, allowing for sophisticated AI-driven tasks.\n",
        "*   **Chunkr**: An advanced document processing API built for efficient and scalable data extraction and preparation, perfect for Retrieval-Augmented Generation (RAG) workflows and large language models (LLMs)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "oA81T8-ToaWz"
      },
      "source": [
        "That's everything: Got questions about 🐫 CAMEL-AI? Join us on [Discord](https://discord.camel-ai.org)! Whether you want to share feedback, explore the latest in multi-agent systems, get support, or connect with others on exciting projects, we’d love to have you in the community! 🤝\n",
        "\n",
        "Check out some of our other work:\n",
        "\n",
        "1. 🐫 Creating Your First CAMEL Agent [free Colab](https://docs.camel-ai.org/cookbooks/create_your_first_agent.html)\n",
        "\n",
        "2.  Graph RAG Cookbook [free Colab](https://colab.research.google.com/drive/1uZKQSuu0qW6ukkuSv9TukLB9bVaS1H0U?usp=sharing)\n",
        "\n",
        "3. 🧑‍⚖️ Create A Hackathon Judge Committee with Workforce [free Colab](https://colab.research.google.com/drive/18ajYUMfwDx3WyrjHow3EvUMpKQDcrLtr?usp=sharing)\n",
        "\n",
        "4. 🔥 3 ways to ingest data from websites with Firecrawl & CAMEL [free Colab](https://colab.research.google.com/drive/1lOmM3VmgR1hLwDKdeLGFve_75RFW0R9I?usp=sharing)\n",
        "\n",
        "5. 🦥 Agentic SFT Data Generation with CAMEL and Mistral Models, Fine-Tuned with Unsloth [free Colab](https://colab.research.google.com/drive/1lYgArBw7ARVPSpdwgKLYnp_NEXiNDOd-?usp=sharingg)\n",
        "\n",
        "Thanks from everyone at 🐫 CAMEL-AI\n",
        "\n",
        "\n",
        "<div class=\"align-center\">\n",
        "  <a href=\"https://www.camel-ai.org/\"><img src=\"https://i.postimg.cc/KzQ5rfBC/button.png\"width=\"150\"></a>\n",
        "  <a href=\"https://discord.camel-ai.org\"><img src=\"https://i.postimg.cc/L4wPdG9N/join-2.png\"  width=\"150\"></a></a>\n",
        "  \n",
        "⭐ <i>Star us on <a href=\"https://github.com/camel-ai/camel\">Github</a> </i>, join our [*Discord*](https://discord.camel-ai.org) or follow our [*X*](https://x.com/camelaiorg)  ⭐\n",
        "</div>\n"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "camel",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.12.8"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
